Deep Residual Output Layers for Neural Language Generation

2019-08-21

本文解决的是建模输出标签的空间结构，尤其是标签种类较多，数据稀疏的情况下，通过深度残差映射网络学习输出空间结构。ICML2019

Introduction

学习输出空间的结构关系有利于许多任务，例如zero-shot分类，当输出空间特别大或者数据稀疏时，将标签视为互相独立的类别会使得预测变得困难，因为预测一个标签时模型无法从其他标签中获得辅助信息。而学习标签的编码可以解决这个问题，因为类似的标签可以相互促进，甚至于帮助zero-shot的分类任务。这种标签空间建模尤其适用于自然语言生成任务，因为词向量本身就是良好的标签空间相似度的衡量。

现有的语言生成方法基本上使用log-linear分类器去预测下一个词（softmax），我们可以将标签权重（即softmax $\mathcal{W}$ 的每一行）视为词向量，input encoder把context映射为同样编码空间的向量，然后使用内积计算input vector和label vector在joint input-label space的相似度，最后再经过softmax函数。虽然可以使用word embedding作为标签向量 _(Tying word vectors and word classifiers: A loss framework for language modeling, Using the output embedding to improve language models)_，但是不同的词之间却没有参数共享，这限制了模型的迁移能力。最近的工作 _Improving tied architectures for language modelling_ 使用bilinear mapping来共享输出之间的参数，_Beyond weight tying: Learning joint input-output embeddings for neural machine translation_ 则是使用 dual nonlinear mapping，增强分类器的性能。

本文提出了一种学习joint input-label space中输出标签编码的方法，提出了deep residual nonlinear mapping from word embeddings to the joint input-output space，有效地获取输出空间的结构信息，同时避免过拟合。在本文中，input encoder的结构和softmax内积操作保持不变。

Background

Neural Language Generation

t时刻的输出由下式计算得到：
$$
p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{W}^{T} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)
$$
其中$\mathbf{W} \in \mathbb{R}^{\mathrm{d}_{\mathrm{h}} \times|\mathcal{V}|}$，第i个标签的类别参数$\mathbf{W}_{i}^{T}$与第j个标签的类别参数$\mathbf{W}_{j}^{T}$是相互独立的。

Weight Tying

可以通过联合学习输入词的词向量矩阵来学习输出空间结构：
$$
p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{E} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)
$$
词向量矩阵$\mathbf{E}\in R^{|V|\times d}$。这种方法可以隐式地学习输出结构。

Bilinear Mapping

在zero-shot文本分类任务中，之前有工作提出下式来联合学习输入输出之间的显式关联，核心是利用共享的参数$\mathbf{W_{1}}$：
$$
p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{E}_{\mathbf{1}} \mathbf{W}_{\mathbf{1}} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)
$$

Dual Nonlinear Mapping

_Beyond weight tying: Learning joint input-output embeddings for neural machine translation_ 提出通过两个非线性映射来分别学习output和context的结构：

Deep Residual Output Layers

本文提出的Deep Residual Output Layers基于下式：
$$
p\left(\mathbf{y}_{t} | \mathbf{y}_{1}^{t-1}\right) \propto \exp \left(g_{o u t}(\mathbf{E}) g_{i n}\left(\mathbf{h}_{t}\right)+\mathbf{b}\right)
$$
$g_{in}(\cdot)$ 以 context representation $h_{t}$ 为输入（在本文中作者设置$g_{in}(\cdot)=\mathcal{I}$），$g_{out}(\cdot)$ 以所有的标签描述为输入，编码为label embedding $\mathbf{E}^{k}$，k是层数。

Label Encoder Network 针对于自然语言生成任务，输出标签为词汇表中的词，在本文中直接使用词向量作为label的输入表示。

In general, there may be additional information about each label, such as dictionary entries, cross-lingual resources, or contextual information, in which case we can add an initial encoder for these descriptions which outputs a label embedding matrix.

为了编码输出空间结构，定义$g_{out}(\cdot)$为k层网络，以label embedding $\mathbf{E}$ 作为输入（即词向量）：
$$
\mathbf{E}^{(\mathbf{k})}=f_{o u t}^{(k)}\left(\mathbf{E}^{(\mathbf{k}-1)}\right)
$$
而每一层的$f_{out}^{(i)}$由下式定义：
$$
f_{o u t}^{(i)}\left(\mathbf{E}^{(i-1)}\right)=\sigma\left(\mathbf{E}^{(i-1)} \mathbf{U}^{(i)}+\mathbf{b}_{\mathbf{u}}^{(i)}\right)
$$
$\sigma$ 是非线性激活函数。

作者又增加了残差连接：
$$
\mathbf{E}^{(\mathbf{k})}=f_{o u t}^{(k)}\left(\mathbf{E}^{(\mathbf{k}-\mathbf{1})}\right)+\mathbf{E}^{(\mathbf{k}-\mathbf{1})}+\mathbf{E}
$$

为了防止过拟合，作者使用了dropout：
$$
f_{\text {out}}^{\prime(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)=\delta\left(f_{\text {out}}^{(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)\right) \odot f_{\text {oul}}^{(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)
$$

Experiments

Language Modeling

More specifically, because low frequency words lack data to individually learn the complex structure of the output space, transfer of learned information from other words is crucial to improving performance, whereas this is not the case for higher frequency words. This analysis suggests that our model could also be useful for zero-resource scenarios, where labels need to be predicted without any training data, similarly to other joint input-output space models.

Neural Machine Translation

Reference

Tying word vectors and word classifiers: A loss framework for language modeling
_Press, O. and Wolf, L. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/E17-2025._
_Gulordava, K., Aina, L., and Boleda, G. How to represent a word and predict it, too: Improving tied architectures for language modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2936–2941, Brussels, Bel-gium, October-November 2018. Association for Com-putational Linguistics. URL http://www.aclweb.org/anthology/D18-1323._
_Pappas, N., Miculicich, L., and Henderson, J. Beyond weight tying: Learning joint input-output embeddings for neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 73–83. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/W18-6308._

Helic He

NLP